{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0ffa8e4f",
   "metadata": {},
   "source": [
    "# Q&A with LangChain\n",
    "\n",
    "This notebook demonstrates how to use LangChain to build a chatbot that references a custom knowledge-base. \n",
    "\n",
    "Suppose you have some text documents (PDF, blog, Notion pages, etc.) and want to ask questions related to the contents of those documents. LLMs, given their proficiency in understanding text, are a great tool for this. \n",
    "\n",
    "### [LangChain](https://python.langchain.com/docs/get_started/introduction)\n",
    "[**LangChain**](https://python.langchain.com/docs/get_started/introduction) provides a simple framework for connecting LLMs to your own data sources. Since LLMs are both only trained up to a fixed point in time and do not contain knowledge that is proprietary to an Enterprise, they can't answer questions about new or proprietary knowledge. LangChain solves this problem.\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    \n",
    "⚠️ The notebook after this one, `03_llama_index_simple.ipynb`, contains the same functionality as this notebook but uses LlamaIndex instead of LangChain. Ultimately, we recommend reading about LangChain vs. LlamaIndex and picking the software/components of the software that makes the most sense to you. \n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65082404",
   "metadata": {},
   "source": [
    "![data_connection](./imgs/data_connection_langchain.jpeg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0ea695d",
   "metadata": {},
   "source": [
    "### Step 1: Integrate TensorRT-LLM to LangChain [*(Connector)*](https://docs.llamaindex.ai/en/stable/examples/llm/nvidia_tensorrt.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f878600a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_nvidia_trt.llms import TritonTensorRTLLM\n",
    "\n",
    "# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below\n",
    "# Replace \"llm\" with the url of the system where llama2 is hosted\n",
    "triton_url = \"llm:8001\"\n",
    "pload = {\n",
    "            'tokens':500,\n",
    "            'server_url': triton_url,\n",
    "            'model_name': \"ensemble\"\n",
    "}\n",
    "llm = TritonTensorRTLLM(**pload)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4ea84ce",
   "metadata": {},
   "source": [
    "#### Note: Follow this step for nemotron models\n",
    "1. In case you have deployed a trt-llm optimized nemotron model following steps [here](../RetrievalAugmentedGeneration/README.md#6-qa-chatbot----nemotron-model), execute the cell below by uncommenting the lines. Here we use a custom wrapper for talking with the model server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1b35872",
   "metadata": {},
   "outputs": [],
   "source": [
    "# from triton_trt_llm import TensorRTLLM\n",
    "# llm = TensorRTLLM(server_url =\"llm:8000\", model_name=\"ensemble\", tokens=500, streaming=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da50462b",
   "metadata": {},
   "source": [
    "### Step 2: Create a Prompt Template [*(Model I/O)*](https://python.langchain.com/docs/modules/model_io/)\n",
    "\n",
    "A [**prompt template**](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) is a common paradigm in LLM development. \n",
    "\n",
    "They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the [prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) shown in `LLAMA_PROMPT_TEMPLATE`, which we manipulate to be constructed with:\n",
    "- The system prompt\n",
    "- The context\n",
    "- The user's question\n",
    "\n",
    "Langchain allows you to [create custom wrappers for your LLM](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm) in case you want to use your own LLM or a different wrapper than the one that is supported in LangChain. Since we are using a custom Llama2 model hosted on Triton with TRT-LLM, we have written a custom wrapper for our LLM. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d22c1c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "LLAMA_PROMPT_TEMPLATE = (\n",
    " \"<s>[INST] <<SYS>>\"\n",
    " \"Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer.\"\n",
    " \"<</SYS>>\"\n",
    " \"<s>[INST] Context: {context} Question: {question} Only return the helpful answer below and nothing else. Helpful answer:[/INST]\"\n",
    ")\n",
    "\n",
    "LLAMA_PROMPT = PromptTemplate.from_template(LLAMA_PROMPT_TEMPLATE)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be6f97af",
   "metadata": {},
   "source": [
    "### Step 3: Load Documents [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)\n",
    "LangChain provides a variety of [document loaders](https://python.langchain.com/docs/integrations/document_loaders) that load various types of documents (HTML, PDF, code) from many different sources and locations (private s3 buckets, public websites).\n",
    "\n",
    "Document loaders load data from a source as **Documents**. A **Document** is a piece of text (the page_content) and associated metadata. Document loaders provide a ``load`` method for loading data as documents from a configured source. \n",
    "\n",
    "In this example, we use a LangChain [`UnstructuredFileLoader`](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) to load a research paper about Llama2 from Meta.\n",
    "\n",
    "[Here](https://python.langchain.com/docs/integrations/document_loaders) are some of the other document loaders available from LangChain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1377c60",
   "metadata": {},
   "outputs": [],
   "source": [
    "! wget -O \"llama2_paper.pdf\" -nc --user-agent=\"Mozilla\" https://arxiv.org/pdf/2307.09288.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be053f41",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import UnstructuredFileLoader\n",
    "loader = UnstructuredFileLoader(\"llama2_paper.pdf\")\n",
    "data = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b909dd82",
   "metadata": {},
   "source": [
    "### Step 4: Transform Documents [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)\n",
    "Once documents have been loaded, they are often transformed. One method of transformation is known as **chunking**, which breaks down large pieces of text, for example, a long document, into smaller segments. This technique is valuable because it helps [optimize the relevance of the content returned from the vector database](https://www.pinecone.io/learn/chunking-strategies/). \n",
    "\n",
    "LangChain provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as text splitters. In this example, we use a [``SentenceTransformersTokenTextSplitter``](https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html). The ``SentenceTransformersTokenTextSplitter`` is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use. This sentence transformer model is used to generate the embeddings from documents. \n",
    "\n",
    "There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfab1d35",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from langchain.text_splitter import SentenceTransformersTokenTextSplitter\n",
    "TEXT_SPLITTER_MODEL = \"intfloat/e5-large-v2\"\n",
    "TEXT_SPLITTER_TOKENS_PER_CHUNK = 510\n",
    "TEXT_SPLITTER_CHUNCK_OVERLAP = 200\n",
    "\n",
    "text_splitter = SentenceTransformersTokenTextSplitter(\n",
    "    model_name=TEXT_SPLITTER_MODEL,\n",
    "    tokens_per_chunk=TEXT_SPLITTER_TOKENS_PER_CHUNK,\n",
    "    chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,\n",
    ")\n",
    "start_time = time.time()\n",
    "documents = text_splitter.split_documents(data)\n",
    "print(f\"--- {time.time() - start_time} seconds ---\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbe64e4e",
   "metadata": {},
   "source": [
    "Let's view a sample of content that is chunked together in the documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b48d697",
   "metadata": {},
   "outputs": [],
   "source": [
    "documents[40].page_content"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0243c9b5",
   "metadata": {},
   "source": [
    "### Step 5: Generate Embeddings and Store Embeddings in the Vector Store [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)\n",
    "\n",
    "#### a) Generate Embeddings\n",
    "[Embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/) for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. This allows you to quickly and efficiently find other pieces of text that are similar. The embedding model used below is [intfloat/e5-large-v2](https://huggingface.co/intfloat/e5-large-v2).\n",
    "\n",
    "LangChain provides a wide variety of [embedding models](https://python.langchain.com/docs/integrations/text_embedding) from many providers and makes it simple to swap out the models. \n",
    "\n",
    "When a user sends in their query, the query is also embedded using the same embedding model that was used to embed the documents. As explained earlier, this allows to find similar (relevant) documents to the user's query. \n",
    "\n",
    "#### b) Store Document Embeddings in the Vector Store\n",
    "Once the document embeddings are generated, they are stored in a vector store so that at query time we can:\n",
    "1) Embed the user query and\n",
    "2) Retrieve the embedding vectors that are most similar to the embedding query.\n",
    "\n",
    "A vector store takes care of storing the embedded data and performing a vector search.\n",
    "\n",
    "LangChain provides support for a [great selection of vector stores](https://python.langchain.com/docs/integrations/vectorstores/). \n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    \n",
    "⚠️ For this workflow, [Milvus](https://milvus.io/) vector database is running as a microservice. \n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88520151",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "from langchain.vectorstores import Milvus\n",
    "import torch\n",
    "import time\n",
    "\n",
    "#Running the model on CPU as we want to conserve gpu memory.\n",
    "#In the production deployment (API server shown as part of the 5th notebook we run the model on GPU)\n",
    "model_name = \"intfloat/e5-large-v2\"\n",
    "model_kwargs = {\"device\": \"cpu\"}\n",
    "encode_kwargs = {\"normalize_embeddings\": False}\n",
    "hf_embeddings = HuggingFaceEmbeddings(\n",
    "    model_name=model_name,\n",
    "    model_kwargs=model_kwargs,\n",
    "    encode_kwargs=encode_kwargs,\n",
    ")\n",
    "start_time = time.time()\n",
    "vectorstore = Milvus.from_documents(documents=documents, embedding=hf_embeddings, connection_args={\"host\": \"milvus\", \"port\": \"19530\"})\n",
    "print(f\"--- {time.time() - start_time} seconds ---\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59e4764f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Simple Example: Retrieve Documents from the Vector Database\n",
    "# note: this is just for demonstration purposes of a similarity search\n",
    "question = \"Can you talk about safety evaluation of llama2 chat?\"\n",
    "docs = vectorstore.similarity_search(question)\n",
    "print(docs[2].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "352708d2",
   "metadata": {},
   "source": [
    " > ### Simple Example: Retrieve Documents from the Vector Database [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)\n",
    ">Given a user query, relevant splits for the question are returned through a **similarity search**. This is also known as a semantic search, and it is done with meaning. It is different from a lexical search, where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. A semantic search tends to generate more relevant results than a lexical search.\n",
    "![vector_stores.jpeg](./imgs/vector_stores.jpeg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50d81f63",
   "metadata": {},
   "source": [
    "### Step 6: Compose a streamed answer using a Chain\n",
    "We have already integrated the Llama2 TRT LLM with the help of LangChain connector, loaded and transformed documents, and generated and stored document embeddings in a vector database. To finish the pipeline, we need to add a few more LangChain components and combine all the components together with a [chain](https://python.langchain.com/docs/modules/chains/).\n",
    "\n",
    "A [LangChain chain](https://python.langchain.com/docs/modules/chains/) combines components together. In this case, we use  [Langchain Expression Language](https://python.langchain.com/docs/expression_language/why) to build a chain.\n",
    "\n",
    "We formulate the prompt placeholders (context and question) and pipe it to our trt-llm connector as shown below and finally stream the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bdb143b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.runnables import RunnablePassthrough\n",
    "import time\n",
    "\n",
    "chain = (\n",
    "    {\"context\": vectorstore.as_retriever(), \"question\": RunnablePassthrough()}\n",
    "    | LLAMA_PROMPT\n",
    "    | llm\n",
    ")\n",
    "start_time = time.time()\n",
    "for token in chain.stream(question):\n",
    "    print(token, end=\"\", flush=True)\n",
    "print(f\"\\n--- {time.time() - start_time} seconds ---\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
